Skip to content

[Parquet][C++] Add experimental VECTOR repetition level for Arrow FixedSizeList#50

Draft
rok wants to merge 1 commit into
mainfrom
vector_repetition_level
Draft

[Parquet][C++] Add experimental VECTOR repetition level for Arrow FixedSizeList#50
rok wants to merge 1 commit into
mainfrom
vector_repetition_level

Conversation

@rok
Copy link
Copy Markdown
Owner

@rok rok commented May 15, 2026

This PR prototypes a new experimental Parquet repetition type, VECTOR, mainly for Arrow's FixedSizeList<T, N> as proposed in Option B here.

Today this is written through Parquet LIST, which means we pay repetition/definition-level costs for an inner shape that is already fixed. VECTOR stores the fixed length in the Parquet schema as vector_length and avoids the inner repetition level.

Writing FixedSizeList<T, N> as VECTOR is opt-in via ArrowWriterProperties::Builder::enable_experimental_vector_encoding() defaulting to LIST otherwise.

Implemented in this prototype

  • Parquet schema/thrift addition of VECTOR
  • Arrow schema conversion:
    • FixedSizeList -> VECTOR when explicitly enabled
    • VECTOR -> FixedSizeList
  • required and nullable rows in VECTOR
  • nullable primitive elements in VECTOR
  • primitive type vectors
  • limited composability proof of concept: FixedSizeList<struct<x: float, y: int32>, N>
  • roundtrip/schema/path tests
  • read/write/roundtrip benchmarks

Deferred for now:

  • dictionary / non-PLAIN encodings
  • statistics
  • page index
  • general nested VECTOR children (non-fixed will not be possible)

Compatibility

We expect non-VECTOR-aware readers to fail when encountering VECTOR repetition level.

Benchmark snapshot

Both VECTOR and LIST use identical WriterProperties (dictionary, statistics and page-index disabled, PLAIN encoding) — the only intended difference is enable_experimental_vector_encoding.

Numbers below are from a local release build on m4 mac, so directional only.

Required FixedSizeList<float, N>: FixedSizeList encoded as LIST vs VECTOR

All with FixedSizeList<float, {80,768,10k}>.

Vector length LIST write VECTOR write Write speedup LIST read VECTOR read Read speedup LIST roundtrip VECTOR roundtrip Roundtrip speedup
80 3.11M rows/s 21.36M rows/s 6.9x 7.36M rows/s 123.31M rows/s 17x 2.13M rows/s 17.68M rows/s 8.3x
768 339.52k rows/s 2.32M rows/s 6.8x 876.86k rows/s 13.06M rows/s 15x 247.32k rows/s 1.86M rows/s 7.5x
10,000 27.41k rows/s 192.99k rows/s 7.0x 66.99k rows/s 795.37k rows/s 12x 19.53k rows/s 165.72k rows/s 8.5x

Realistic embedding row: (int64 id, timestamp ts, int32 category, float score, string label, FSL<float, N> embedding)

Vector length LIST write VECTOR write Write speedup LIST read VECTOR read Read speedup LIST roundtrip VECTOR roundtrip Roundtrip speedup
768 347.58k rows/s 2.37M rows/s 6.8x 864.73k rows/s 11.87M rows/s 14x 248.52k rows/s 1.79M rows/s 7.2x

@rok rok changed the title Vector repetition level proposal Draft: [Parquet][C++] Add experimental VECTOR repetition support for Arrow FixedSizeList May 15, 2026
@rok rok changed the title Draft: [Parquet][C++] Add experimental VECTOR repetition support for Arrow FixedSizeList [Parquet][C++] Add experimental VECTOR repetition level for Arrow FixedSizeList May 15, 2026
Repository owner deleted a comment from github-actions Bot May 15, 2026
@rok rok marked this pull request as draft May 15, 2026 19:22
Repository owner deleted a comment from github-actions Bot May 15, 2026
Repository owner deleted a comment from github-actions Bot May 15, 2026
@rok rok force-pushed the vector_repetition_level branch from e12309e to d660076 Compare May 19, 2026 19:10
@rok rok force-pushed the vector_repetition_level branch from 470692a to ada6b22 Compare May 20, 2026 01:32
Repository owner deleted a comment from github-actions Bot May 20, 2026
Repository owner deleted a comment from github-actions Bot May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant